Word Embeddings and Sentiment¶

In this colab, you'll work with word embeddings and train a basic neural network to predict text sentiment. At the end, you'll be able to visualize how the network sees the related sentiment of each word in the dataset.

Import TensorFlow and related functions¶

import tensorflow as tf

from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

Get the dataset¶

We're going to use a dataset containing Amazon and Yelp reviews, with their related sentiment (1 for positive, 0 for negative). This dataset was originally extracted from here.

!wget --no-check-certificate \
    -O /tmp/sentiment.csv https://drive.google.com/uc?id=13ySLC_ue6Umt9RJYSeM2t-V0kCv-4C-P

--2020-08-08 15:40:32--  https://drive.google.com/uc?id=13ySLC_ue6Umt9RJYSeM2t-V0kCv-4C-P
Resolving drive.google.com (drive.google.com)... 74.125.20.138, 74.125.20.102, 74.125.20.100, ...
Connecting to drive.google.com (drive.google.com)|74.125.20.138|:443... connected.
HTTP request sent, awaiting response... 302 Moved Temporarily
Location: https://doc-08-ak-docs.googleusercontent.com/docs/securesc/ha0ro937gcuc7l7deffksulhg5h7mbp1/07o4u60mkn0l74qeocivuaa79hm2g351/1596901200000/11118900490791463723/*/13ySLC_ue6Umt9RJYSeM2t-V0kCv-4C-P [following]
Warning: wildcards not supported in HTTP.
--2020-08-08 15:40:32--  https://doc-08-ak-docs.googleusercontent.com/docs/securesc/ha0ro937gcuc7l7deffksulhg5h7mbp1/07o4u60mkn0l74qeocivuaa79hm2g351/1596901200000/11118900490791463723/*/13ySLC_ue6Umt9RJYSeM2t-V0kCv-4C-P
Resolving doc-08-ak-docs.googleusercontent.com (doc-08-ak-docs.googleusercontent.com)... 74.125.135.132, 2607:f8b0:400e:c01::84
Connecting to doc-08-ak-docs.googleusercontent.com (doc-08-ak-docs.googleusercontent.com)|74.125.135.132|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 127831 (125K) [text/csv]
Saving to: ‘/tmp/sentiment.csv’

/tmp/sentiment.csv  100%[===================>] 124.83K  --.-KB/s    in 0.001s  

2020-08-08 15:40:32 (128 MB/s) - ‘/tmp/sentiment.csv’ saved [127831/127831]

import numpy as np
import pandas as pd

dataset = pd.read_csv('/tmp/sentiment.csv')

sentences = dataset['text'].tolist()
labels = dataset['sentiment'].tolist()

# Separate out the sentences and labels into training and test sets
training_size = int(len(sentences) * 0.8)

training_sentences = sentences[0:training_size]
testing_sentences = sentences[training_size:]
training_labels = labels[0:training_size]
testing_labels = labels[training_size:]

# Make labels into numpy arrays for use with the network later
training_labels_final = np.array(training_labels)
testing_labels_final = np.array(testing_labels)

Tokenize the dataset¶

Tokenize the dataset, including padding and OOV

vocab_size = 1000
embedding_dim = 16
max_length = 100
trunc_type='post'
padding_type='post'
oov_tok = "<OOV>"


from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

tokenizer = Tokenizer(num_words = vocab_size, oov_token=oov_tok)
tokenizer.fit_on_texts(training_sentences)
word_index = tokenizer.word_index
sequences = tokenizer.texts_to_sequences(training_sentences)
padded = pad_sequences(sequences,maxlen=max_length, padding=padding_type, 
                       truncating=trunc_type)

testing_sequences = tokenizer.texts_to_sequences(testing_sentences)
testing_padded = pad_sequences(testing_sequences,maxlen=max_length, 
                               padding=padding_type, truncating=trunc_type)

Review a Sequence¶

Let's quickly take a look at one of the padded sequences to ensure everything above worked appropriately.

reverse_word_index = dict([(value, key) for (key, value) in word_index.items()])

def decode_review(text):
    return ' '.join([reverse_word_index.get(i, '?') for i in text])

print(decode_review(padded[1]))
print(training_sentences[1])

good case excellent value ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?
Good case Excellent value.

Train a Basic Sentiment Model with Embeddings¶

# Build a basic sentiment network
# Note the embedding layer is first, 
# and the output is only 1 node as it is either 0 or 1 (negative or positive)
model = tf.keras.Sequential([
    tf.keras.layers.Embedding(vocab_size, embedding_dim, input_length=max_length),
    tf.keras.layers.Flatten(),
    tf.keras.layers.Dense(6, activation='relu'),
    tf.keras.layers.Dense(1, activation='sigmoid')
])
model.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding (Embedding)        (None, 100, 16)           16000     
_________________________________________________________________
flatten (Flatten)            (None, 1600)              0         
_________________________________________________________________
dense (Dense)                (None, 6)                 9606      
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 7         
=================================================================
Total params: 25,613
Trainable params: 25,613
Non-trainable params: 0
_________________________________________________________________

num_epochs = 10
model.fit(padded, training_labels_final, epochs=num_epochs, validation_data=(testing_padded, testing_labels_final))

Epoch 1/10
50/50 [==============================] - 0s 7ms/step - loss: 0.6922 - accuracy: 0.5229 - val_loss: 0.6945 - val_accuracy: 0.4135
Epoch 2/10
50/50 [==============================] - 0s 3ms/step - loss: 0.6862 - accuracy: 0.5461 - val_loss: 0.6895 - val_accuracy: 0.5238
Epoch 3/10
50/50 [==============================] - 0s 3ms/step - loss: 0.6594 - accuracy: 0.6353 - val_loss: 0.6673 - val_accuracy: 0.6341
Epoch 4/10
50/50 [==============================] - 0s 3ms/step - loss: 0.5825 - accuracy: 0.7872 - val_loss: 0.6074 - val_accuracy: 0.7168
Epoch 5/10
50/50 [==============================] - 0s 3ms/step - loss: 0.4568 - accuracy: 0.8701 - val_loss: 0.5354 - val_accuracy: 0.7769
Epoch 6/10
50/50 [==============================] - 0s 3ms/step - loss: 0.3396 - accuracy: 0.9096 - val_loss: 0.5166 - val_accuracy: 0.7619
Epoch 7/10
50/50 [==============================] - 0s 3ms/step - loss: 0.2592 - accuracy: 0.9366 - val_loss: 0.5089 - val_accuracy: 0.7519
Epoch 8/10
50/50 [==============================] - 0s 3ms/step - loss: 0.2057 - accuracy: 0.9504 - val_loss: 0.4753 - val_accuracy: 0.7820
Epoch 9/10
50/50 [==============================] - 0s 3ms/step - loss: 0.1618 - accuracy: 0.9611 - val_loss: 0.4685 - val_accuracy: 0.7970
Epoch 10/10
50/50 [==============================] - 0s 3ms/step - loss: 0.1297 - accuracy: 0.9743 - val_loss: 0.4707 - val_accuracy: 0.8020

<tensorflow.python.keras.callbacks.History at 0x7f9f001f85f8>

Get files for visualizing the network¶

The code below will download two files for visualizing how your network "sees" the sentiment related to each word. Head to http://projector.tensorflow.org/ and load these files, then click the "Sphereize" checkbox.

# First get the weights of the embedding layer
e = model.layers[0]
weights = e.get_weights()[0]
print(weights.shape) # shape: (vocab_size, embedding_dim)

(1000, 16)

import io

# Write out the embedding vectors and metadata
out_v = io.open('vecs.tsv', 'w', encoding='utf-8')
out_m = io.open('meta.tsv', 'w', encoding='utf-8')
for word_num in range(1, vocab_size):
  word = reverse_word_index[word_num]
  embeddings = weights[word_num]
  out_m.write(word + "\n")
  out_v.write('\t'.join([str(x) for x in embeddings]) + "\n")
out_v.close()
out_m.close()

# Download the files
try:
  from google.colab import files
except ImportError:
  pass
else:
  files.download('vecs.tsv')
  files.download('meta.tsv')

Predicting Sentiment in New Reviews¶

Now that you've trained and visualized your network, take a look below at how we can predict sentiment in new reviews the network has never seen before.

# Use the model to predict a review   
fake_reviews = ['I love this phone', 'I hate spaghetti', 
                'Everything was cold',
                'Everything was hot exactly as I wanted', 
                'Everything was green', 
                'the host seated us immediately',
                'they gave us free chocolate cake', 
                'not sure about the wilted flowers on the table',
                'only works when I stand on tippy toes', 
                'does not work when I stand on my head']

print(fake_reviews) 

# Create the sequences
padding_type='post'
sample_sequences = tokenizer.texts_to_sequences(fake_reviews)
fakes_padded = pad_sequences(sample_sequences, padding=padding_type, maxlen=max_length)           

print('\nHOT OFF THE PRESS! HERE ARE SOME NEWLY MINTED, ABSOLUTELY GENUINE REVIEWS!\n')              

classes = model.predict(fakes_padded)

# The closer the class is to 1, the more positive the review is deemed to be
for x in range(len(fake_reviews)):
  print(fake_reviews[x])
  print(classes[x])
  print('\n')

# Try adding reviews of your own
# Add some negative words (such as "not") to the good reviews and see what happens
# For example:
# they gave us free chocolate cake and did not charge us

['I love this phone', 'I hate spaghetti', 'Everything was cold', 'Everything was hot exactly as I wanted', 'Everything was green', 'the host seated us immediately', 'they gave us free chocolate cake', 'not sure about the wilted flowers on the table', 'only works when I stand on tippy toes', 'does not work when I stand on my head']

HOT OFF THE PRESS! HERE ARE SOME NEWLY MINTED, ABSOLUTELY GENUINE REVIEWS!

I love this phone
[0.99074733]


I hate spaghetti
[0.04332022]


Everything was cold
[0.28719485]


Everything was hot exactly as I wanted
[0.6782568]


Everything was green
[0.50050545]


the host seated us immediately
[0.84949315]


they gave us free chocolate cake
[0.8820852]


not sure about the wilted flowers on the table
[0.01858082]


only works when I stand on tippy toes
[0.9718807]


does not work when I stand on my head
[0.01166688]